68 research outputs found

    HindEnCorp – Hindi-English and Hindi-only Corpus for Machine Translation

    Get PDF
    We present HindEnCorp, a parallel corpus of Hindi and English, and HindMonoCorp, a monolingual corpus of Hindi in their release version 0.5. Both corpora were collected from web sources and preprocessed primarily for the training of statistical machine translation systems. HindEnCorp consists of 274k parallel sentences (3.9 million Hindi and 3.8 million English tokens). HindMonoCorp amounts to 787 million tokens in 44 million sentences. Both the corpora are freely available for non-commercial research and their preliminary release has been used by numerous participants of the WMT 2014 shared translation task

    MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages

    Get PDF
    We present the most relevant results of the project MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages in its second year. Parallel and monolingual corpora have been produced for eleven low-resourced European languages by crawling large amounts of textual data from selected top-level domains of the Internet; both human and automatic evaluation show its usefulness. In addition, several large language models pretrained on MaCoCu data have been published, as well as the code used to collect and curate the data.This action has received funding from the European Union’s Connecting Europe Facility 2014-2020 - CEF Telecom, under Grant Agreement No. INEA/CEF/ICT/A2020/2278341

    MaCoCu:Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages

    Get PDF
    We introduce the project MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages, funded by the Connecting Europe Facility, which is aimed at building monolingual and parallel corpora for under-resourced European languages. The approach followed consists of crawling large amounts of textual data from selected top-level domains of the Internet, and then applying a curation and enrichment pipeline. In addition to corpora, the project will release the free/open-source web crawling and curation software used.</p

    MaCoCu:Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages

    Get PDF
    We introduce the project MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages, funded by the Connecting Europe Facility, which is aimed at building monolingual and parallel corpora for under-resourced European languages. The approach followed consists of crawling large amounts of textual data from selected top-level domains of the Internet, and then applying a curation and enrichment pipeline. In addition to corpora, the project will release the free/open-source web crawling and curation software used.</p

    Czech Web Corpus 2017 (csTenTen17)

    No full text
    The Czech Web Corpus 2017 (csTenTen17) is a Czech corpus made up of texts collected from the Internet, mostly from the Czech national top level domain ".cz". The data was crawled by web crawler SpiderLing (https://corpus.tools/wiki/SpiderLing). The data was cleaned by removing boilerplate (using https://corpus.tools/wiki/Justext), removing near-duplicate paragraphs (by https://corpus.tools/wiki/Onion) and discarding paragraphs not in the target language. The corpus was POS annotated by morphological analyser Majka using this POS tagset: https://www.sketchengine.eu/tagset-reference-for-czech/. Text sources: General web, Wikipedia. Time span of crawling: May, October and November 2017, October and November 2016, October and November 2015. The Czech Wikipedia part was downloaded in November 2017. Data format: Plain text, vertical (one token per line), gzip compressed. There are the following structures in the vertical: Documents (, usually corresponding to web pages), paragraphs (), sentences () and word join markers (, a "glue" tag indicating that there was no space between the surrounding tokens in the original text). Document metadata: src (the source of the data), title (the title of the web page), url (the URL of the document), crawl_date (the date of downloading the document). Paragraph metadata: heading ("1" if the paragraph is a heading, usually to elements in the original HTML data). Block elements in the case of an HTML source or double blank lines in the case of other source formats were used as paragraph separators. An internal heuristic tool was used to mark sentence breaks. The tab-separated positional attributes are: word form, morphological annotation, lem-POS (the base form of the word, i.e. the lemma, with a part of speech suffix) and gender respecting lemma (nouns and adjectives only). Please cite the following paper when using the corpus for your research: Suchomel, Vít. csTenTen17, a Recent Czech Web Corpus. In Recent Advances in Slavonic Natural Language Processing, pp. 111–123. 2018. (https://nlp.fi.muni.cz/raslan/raslan18.pdf#page=119

    FEM Analysis of a Hybrid Synchronous Generator

    No full text
    Tato diplomová práce se zabývá výpočtem a návrhem synchronního generátoru s vyniklými póly a následnou úpravou tohoto stroje na takzvaný hybridní synchronní stroj. Práce je zpracovávána pro společnost Siemens Electric Machines s.r.o. v Drásově. Od této firmy byl poskytnut návrh synchronního generátoru s vyniklými póly. Z návrhu stroje se vytvořil model v programu Ansys, a to v prostředí RMxprt a v prostředí Maxwell 2D. Návrhy se porovnaly a vyhodnotily se jejich vypočítané parametry a jejich odlišnosti. Vytvořený model synchronního generátoru se v programu Maxwell upravil na takzvaný hybridní synchronní generátor. Do rotoru s vinutím byly navíc přidány permanentní magnety a zkoumal se jejich vliv. V části Úvod je blíže popsáno a rozvedeno zadání práce. V teoretické části je obecně popsán synchronní generátor a také hybridní synchronní generátor. V další části práce je popsán návrh synchronního generátoru s vyniklými póly. Následně je popsán výpočet a návrh modelu zadaného generátoru v programu Ansys. Dále jsou zde srovnány dílčí výsledky modelu generátoru s návrhem firmy. Poté je popsána tvorba modelu hybridního generátoru. A jsou zde srovnány a vyhodnoceny vytvořené modely hybridních generátorů.This work deals with calculation and design of synchronous salient pole generator and then the transformation from this machine into so-called hybrid synchronous machine is described. This diploma thesis is done for company Siemens Electric Machines s.r.o. in Drásov. This company provided their design of synchronous salient pole generator. The model of this generator was created in programs RMxprt and Maxwell 2D. Suggestions and models were compared, and their calculated parameters were evaluated. The created model was upgraded to so-called hybrid synchronous generator. Permanent magnets were added in the created model and influence of these additional magnets is going to be analyzed. Tasks of the work are described in more details. in the introduction. The topic of synchronous generator and hybrid synchronous generator in general is described in the theoretical part. The next part of work deals with the design of salient pole generator. In the other part of the work calculation and design of the generator model are described in Ansys program. Furthermore there are some results from program compared with the results from company. Some model parts and other graphic outputs from Ansys program are shown. After that the creation of hybrid synchronous generator is described and created models of hybrid synchronous generators are compared and evaluated.
    corecore